1 Executive Summary

Something catchy! What the executive wants to see - something that interests them. It is important to note that this is an observational study - only association can be inferred, not causation. Therefore, an important distinction must be made in that the observational nature of this data cannot imply that majoring in Engineering will cause one to have a larger income, rather, it can only imply the association between the two.

2 Full Report

2.1 Initial Data Analysis (IDA)

2.1.1 Source

The data was collected from the American Community Survey 2010 - 2012 Public Use Microdata Sample Files (PUMS) at the USA Census Website. It was initially wrangled by American media company FiveThirtyEight (a part of ABC News Internet Ventures). The code used to wrangle the data can be viewed here.

2.1.2 Stakeholders

According to their website, the Census Bureau produces the PUMS as an inexpensive and accessible datasource for students, social scientists, and marketing analysists. The media company FiveThirtyEight sorted this data and created the dataset for use in their article The Economic Guide to Picking a College Major, aimed at educating students on how to best choose their own college majors.

While the Census Bureau is a government organisation and therefore can be assumed to be unbiased, FiveThirtyEight’s article was commercially motivated, as the website runs advertisements. However, FiveThirtyEight’s interpretation of the data was quite transparent, given that they provided the code and made minimal changes to the actual content of the data. The data can therefore be considered unbiased and valid.

2.1.3 Data Dictionary

str(gradData)
## 'data.frame':    173 obs. of  21 variables:
##  $ Rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Major_code          : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
##  $ Major               : Factor w/ 173 levels "ACCOUNTING","ACTUARIAL SCIENCE",..: 141 116 113 132 24 134 2 15 109 53 ...
##  $ Total               : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
##  $ Men                 : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
##  $ Women               : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
##  $ Major_category      : Factor w/ 16 levels "Agriculture & Natural Resources",..: 8 8 8 8 8 8 4 14 8 8 ...
##  $ ShareWomen          : num  0.121 0.102 0.153 0.107 0.342 ...
##  $ Sample_size         : int  36 7 3 16 289 17 51 10 1029 631 ...
##  $ Employed            : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
##  $ Full_time           : int  1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
##  $ Part_time           : int  270 170 133 150 5180 264 296 553 13101 12695 ...
##  $ Full_time_year_round: int  1207 388 340 692 16697 1449 2482 827 54639 41413 ...
##  $ Unemployed          : int  37 85 16 40 1672 400 308 33 4650 3895 ...
##  $ Unemployment_rate   : num  0.0184 0.1172 0.0241 0.0501 0.0611 ...
##  $ Median              : int  110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
##  $ P25th               : int  95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
##  $ P75th               : int  125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
##  $ College_jobs        : int  1534 350 456 529 18314 1142 1768 972 52844 45829 ...
##  $ Non_college_jobs    : int  364 257 176 102 4440 657 314 500 16384 10874 ...
##  $ Low_wage_jobs       : int  193 50 0 0 972 244 259 220 3253 3170 ...

This data consists of 20 variables (excluding “Rank” which orders the subjects by Median income), however, only xx variables are relevant for the study:

Major_code

A unique code for each major, given by the source.
Type: Integer
Assessment: Although it is a number, a factor classification would be more suitable as the codes are considered nominal (no order).

Major

The major’s name.
Type: Factor
Assessment: Either a character or factor classification would be suitable.

Total, Men, Women

Amount of total people, men, and women respectively with that major in the sample for 2010-2012.
Type: Integer
Assessment: Suitable.

Major_category

General category for that major (e.g. “Engineering”).
Type: Factor
Assessment: Suitable - allows for easy classification and plotting.

ShareWomen

Women as a percentage of Total.
Type: Number
Assessment: Suitable, since it is provided as a decimal (multiply by 100 if plotting percentages).

Sample_size

Sample size for calculating income quartiles.
Type: Integer
Assessment: Suitable.

Employed, Full_time, Part_time

Number of people employed, employed 35 hours or more per week, and employed 35 hours or less respectively.
Type: Integer
Assessment: Suitable.

Full_time_year_round

Number of people employed for at least 50 weeks per year and over 35 hours hours per week.
Type: Integer
Assessment: Suitable.

Unemployed

Number of people considered unemployed by census data.
Type: Integer
Assessment: Suitable.

Unemployment_rate

The percentage of people unemployed over (unemployed + employed).
Type: Number
Assessment: Suitable.

Median, P25th, P75th

Median, 25th percentile, and 75th percentile earnings respectively for full-time, year-round workers (in USD).
Type: Integer
Assessment: Suitable - although income is continuous, it can be considered discrete without significantly impacting the data.

College_jobs, Non_college_jobs, Low_wage_jobs

Number of people with a job requiring a college degree, not requiring a college degree, and in a low-wage service job respectively.
Type: Integer
Assessment: Suitable.

2.1.4 Data Assessment

Possible Issues:

  • The data consists of pre-summarised information, meaning that each subject (major) is a set of medians, percentages, etc. Therefore, care needs to be made in plotting and drawing conclusions.
  • The data is categorised as an observational study, lacking in a control group and randomised allocation between majors. Therefore, any conclusions drawn must be treated as associativity rather than causality.
  • Income has not been adjusted for inflation, however, this is not a major issue considering we are concerned with the relative comparisons.
  • Data has been pre-filtered to only include subjects below the age of 28, and therefore can be considered a sample, not a population. However, this can also be a positive in that it is more relevant to current university students.


Validity:
This data, taking into account the issues above and their solutions, can be considered valid.


2.1.5 Domain Knowledge

xxxxxx

2.2 New questions:

Q1: Which college majors give the highest incomes? Sub-question: How does this take into account spread (i.e. compare 75th vs median vs 25th)

Q2: Which college majors give the highest employability (based on unemployment)? Sub-question: How much of this employment is actually in their field / based on the degree (looking at college jobs vs non-college jobs)?

Q3: Looking at the results from Q1 and Q2, how do these “rankings” align with the popularity of these courses (based on total people = employed + unemployed)

2.3 Research Question 1

Which college major should a student take to receive the highest income?

There are three variables to consider - the 25th percentile, median, and 75th percentile incomes. Additionally, it is important to consider both individual majors and major categories. Taking a summary initially shows that there is a significant range of incomes:

summary(gradData$Median)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22000   33000   36000   40151   45000  110000
plot_ly(gradData, y=~Median/1000, color=~Major_category, type="box") %>% 
  layout(    
    yaxis = list(title = "Median income (USD$1000)"),
    xaxis = list(showticklabels = FALSE),
    title = "Median Income per Major Category")

Plotting the median income against major category backs up the summary - showing a large spread, centred around the median of $36,000.

# Selects top 10
gradData.head = head(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$Median)
p25.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$P25th)
p75.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category, 
                       Income = gradData.head$P75th)
gradData.head.df = rbind(median.df, p25.df, p75.df)
# Selects bottom 10
gradData.tail = tail(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$Median)
p25.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$P25th)
p75.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category, 
                       Income = gradData.tail$P75th)
gradData.tail.df = rbind(median.df, p25.df, p75.df)
score = rbind(gradData.tail, gradData.tail, gradData.tail)
# Combines the two 
gradData.combined.df =rbind(gradData.head.df, gradData.tail.df)
# Orders it
score = rbind(gradData.head, gradData.head, gradData.head, score)
# Plots a boxplot
plot_ly(gradData.combined.df, y=~Income/1000, x=~reorder(Major, -score$Median), color=~Major_category, type="box") %>%
  layout(    
    yaxis = list(
      title = "Median income (USD$1000)",
      autotick = FALSE,
      ticks = "outside",
      tick0 = 0,
      dtick = 10,
      ticklen = 3,
      tickwidth = 1),
    xaxis = list(showticklabels = FALSE, title=""),
    title = "Top 10 and Bottom 10 Majors by Median Income")

Looking at individual majors, there are initially too many data-points to make sense of the information. Instead, ordering the data by median income, the subjects can be limited to only the top and bottom 10 majors (note that this plotted data takes into account median, 25th percentile, and 75th percentile). While 9 of the top 10 majors belong to the Engineering category, the bottom 10 majors are considerably more varied.


combined = rbind(gradData[gradData$Major_category=="Engineering",],
                 gradData[gradData$Major_category=="Education",], gradData[gradData$Major_category=="Business",])
median.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=combined$Major_category, 
                       Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) + 
           xlab("Income (USD$1000)") + ylab("Density") + labs(fill="Major Category", title="Income Density per Major Category") + theme_minimal())

Examining the density estimation of a selection of major categories, again Engineering appears to have significantly higher incomes compared to other categories. However, the estimation shows that the spread is also significantly larger, with a portion of the income falling within the range of the lowest majors. This is contrasted with Education, where the range is confined to ~$25,000.

# Coefficient of Variation for the Engineering sample's incomes
sd(combined.df[combined.df$Major_category=="Engineering",]$Income)/
  mean(combined.df[combined.df$Major_category=="Engineering",]$Income) 
## [1] 0.3296089
# Coefficient of Variation for the Education sample's incomes
sd(combined.df[combined.df$Major_category=="Education",]$Income)/
  mean(combined.df[combined.df$Major_category=="Education",]$Income) 
## [1] 0.2089347

This is re-iterated by the coefficient of variation for Engineering being over 150% of Education’s.

Summary:
The data shows that Engineering incomes can far exceed those in other categories, with Petroleum Engineering in particular being significantly higher than the other majors. Indeed, the separation of Petroleum Engineering from the other top 10 median incomes is comparable to the separation of the top 10 from the bottom 10. However, Engineering incomes overall have a significantly larger spread than the other categories, implying a volatility either between majors or within the industries themselves. Nevertheless, students seeking high incomes may be best suited to look towards Engineering fields.

2.4 Research Question 2

Which college major should a student pursue to see the greatest prospects for employment?
Insert text and analysis.

percentCollege = gradData$College_jobs/(gradData$College_jobs+gradData$Non_college_jobs+gradData$Low_wage_jobs)
ggplot(gradData, aes(x=factor(Major_category), y=percentCollege)) + 
  geom_boxplot(aes(fill = factor(Major_category))) +
  theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank()) +
  labs(fill = "Major Category", title="Percent College Jobs per Major Category") + 
  ylab("Percentage of Jobs as College Jobs")
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).


How much of this employment is actually in their field / based on the degree (looking at college jobs vs non-college jobs)?


2.5 Research Question 3

Looking at the results from Q1 and Q2, how do these “rankings” align with the popularity of these courses? (based on total people = employed + unemployed)

Insert text and analysis.

Summary:


sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2 plotly_4.8.0   ggplot2_3.1.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0         RColorBrewer_1.1-2 later_0.8.0       
##  [4] pillar_1.3.1       compiler_3.5.2     plyr_1.8.4        
##  [7] bindr_0.1.1        tools_3.5.2        digest_0.6.18     
## [10] viridisLite_0.3.0  jsonlite_1.6       evaluate_0.12     
## [13] tibble_2.0.1       gtable_0.2.0       pkgconfig_2.0.2   
## [16] rlang_0.3.1        shiny_1.2.0        crosstalk_1.0.0   
## [19] yaml_2.2.0         xfun_0.4           withr_2.1.2       
## [22] dplyr_0.7.8        stringr_1.4.0      httr_1.4.0        
## [25] knitr_1.21         htmlwidgets_1.3    grid_3.5.2        
## [28] tidyselect_0.2.5   glue_1.3.0         data.table_1.12.0 
## [31] R6_2.3.0           rmarkdown_1.11     tidyr_0.8.2       
## [34] purrr_0.3.0        magrittr_1.5       promises_1.0.1    
## [37] scales_1.0.0       htmltools_0.3.6    assertthat_0.2.0  
## [40] xtable_1.8-3       mime_0.6           colorspace_1.4-0  
## [43] httpuv_1.4.5.1     labeling_0.3       stringi_1.2.4     
## [46] lazyeval_0.2.1     munsell_0.5.0      crayon_1.3.4